10 research outputs found
Adapting Sequence to Sequence models for Text Normalization in Social Media
Social media offer an abundant source of valuable raw data, however informal
writing can quickly become a bottleneck for many natural language processing
(NLP) tasks. Off-the-shelf tools are usually trained on formal text and cannot
explicitly handle noise found in short online posts. Moreover, the variety of
frequently occurring linguistic variations presents several challenges, even
for humans who might not be able to comprehend the meaning of such posts,
especially when they contain slang and abbreviations. Text Normalization aims
to transform online user-generated text to a canonical form. Current text
normalization systems rely on string or phonetic similarity and classification
models that work on a local fashion. We argue that processing contextual
information is crucial for this task and introduce a social media text
normalization hybrid word-character attention-based encoder-decoder model that
can serve as a pre-processing step for NLP applications to adapt to noisy text
in social media. Our character-based component is trained on synthetic
adversarial examples that are designed to capture errors commonly found in
online user-generated text. Experiments show that our model surpasses neural
architectures designed for text normalization and achieves comparable
performance with state-of-the-art related work.Comment: Accepted at the 13th International AAAI Conference on Web and Social
Media (ICWSM 2019
FixEval: Execution-based Evaluation of Program Fixes for Programming Problems
The increasing complexity of software has led to a drastic rise in time and
costs for identifying and fixing bugs. Various approaches are explored in the
literature to generate fixes for buggy code automatically. However, few tools
and datasets are available to evaluate model-generated fixes effectively due to
the large combinatorial space of possible fixes for a particular bug. In this
work, we introduce FIXEVAL, a benchmark comprising buggy code submissions to
competitive programming problems and their respective fixes. FIXEVAL is
composed of a rich test suite to evaluate and assess the correctness of
model-generated program fixes and further information regarding time and memory
constraints and acceptance based on a verdict. We consider two Transformer
language models pretrained on programming languages as our baselines and
compare them using match-based and execution-based evaluation metrics. Our
experiments show that match-based metrics do not reflect model-generated
program fixes accurately. At the same time, execution-based methods evaluate
programs through all cases and scenarios designed explicitly for that solution.
Therefore, we believe FIXEVAL provides a step towards real-world automatic bug
fixing and model-generated code evaluation. The dataset and models are
open-sourced.\footnote{\url{https://github.com/mahimanzum/FixEval}
Rationalization for Explainable NLP: A Survey
Recent advances in deep learning have improved the performance of many
Natural Language Processing (NLP) tasks such as translation,
question-answering, and text classification. However, this improvement comes at
the expense of model explainability. Black-box models make it difficult to
understand the internals of a system and the process it takes to arrive at an
output. Numerical (LIME, Shapley) and visualization (saliency heatmap)
explainability techniques are helpful; however, they are insufficient because
they require specialized knowledge. These factors led rationalization to emerge
as a more accessible explainable technique in NLP. Rationalization justifies a
model's output by providing a natural language explanation (rationale). Recent
improvements in natural language generation have made rationalization an
attractive technique because it is intuitive, human-comprehensible, and
accessible to non-technical users. Since rationalization is a relatively new
field, it is disorganized. As the first survey, rationalization literature in
NLP from 2007-2022 is analyzed. This survey presents available methods,
explainable evaluations, code, and datasets used across various NLP tasks that
use rationalization. Further, a new subfield in Explainable AI (XAI), namely,
Rational AI (RAI), is introduced to advance the current state of
rationalization. A discussion on observed insights, challenges, and future
directions is provided to point to promising research opportunities
Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding
Deep Learning, a growing sub-field of machine learning, has been applied with tremendous success in a variety of domains, opening opportunities for achieving human level performance in many applications. However, Deep Learning methods depend on large quantities of data with millions of annotated instances. And while well-formed academic datasets have helped advance supervised learning research, in the real-word we are daily deluged by massive amounts of unstructured data, that remain unusable for current supervised learning approaches, as only a small portion is either labeled, cleaned or structured.
In order for a machine learning model to be effective, volume is not the only data dimension that is necessary. Quality is equally important and has proven to be a critical factor for the success of industrial applications of machine learning. According to IBM, poor data quality can cost more than 3 trillion US dollars per year for the US market alone. Inspired by the need for advanced methods that can efficiently address such bottlenecks, we develop machine learning techniques can be leveraged to improve upon data quality in both data-related dimensions: input and output space.
Having a set of labeled examples that can capture the task characteristics is one of the most important prerequisites for successfully applying machine learning. As such, we first focus on minimizing the annotation effort for any arbitrary user-defined task by exploring active learning methods. We show that the best performing active learning strategy depends on the task at-hand and we propose a combination of active learners, maximizing annotation performance early in the process. We demonstrate the viability of the approach on several relation extraction tasks.
Next, we observe that even though our method can be used to speed up the collection of labeled training data, the rest will remain unlabeled and thus unexploited. Semi-supervised learning methods proposed in the literature can utilize additional unlabeled data, however, are typically compared on computer vision datasets such as CIFAR10. Here, we perform a systematic exploration of several semi-supervised methods for three sequence labeling tasks and two classification tasks.
Additionally, most methods have assumptions that are less suitable to realistic scenarios. For example, proposed methods in the recent literature treat all unlabeled examples equally. Yet, in many cases we would like to sort out examples that might be less useful or confusing, particularly in noisy settings where examples with low training loss or high confidence are more likely to be clean examples. In addition, most methods assume that the unlabeled data can be classified into the same classes as the labeled data. This does not take into consideration the very possible scenario of out-of-class instances. For example, our classifier may be distinguishing cats from dogs, but the unlabeled examples may contain additional classes, such as shells, butterflies, etc. To this end, we design methods to mitigate these issues, with a re-weighting mechanism that can be incorporated to any consistency-based regularizer.
Both active and semi-supervised learning methods aim to reduce labeling efforts by either automatically expanding the training set or selecting the most informative examples for human annotation. However, bootstrapping approaches often result in negative effects on NLP tasks due to the addition of falsely labeled instances. We address the challenge of producing good quality proxy labels, by leveraging the continuously growing stream of human annotations. We introduce a calibration of semi-supervised active learning where the confidence of the classifier is weighted by an auxiliary neural model that remove incorrectly labeled instances and dynamically adjusts the number of proxy labels included in each iteration. Experimental results show that our strategy outperforms baselines that combine traditional active learning with self-training.
We have explored various ways on how to improve the output space of examples. But the input representation is also equally important. Particularly for social media, (the most abundant source of raw data nowadays) informal writing can cause several bottlenecks. For example, most Information Extraction (IE) tools rely on accurate understanding of text and struggle with the noisy and informal nature of social media due to high out-of-vocabulary (OOV) word rates. In this work, we design a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for any off-the-shelf NLP tool to adapt to social media noisy text. Our model surpasses baseline neural models designed for text normalization and achieves comparable performance with state-of-the-art related work.
Although we evaluate on NLP tasks, all methods developed are fairly general and can be applied to other supervised machine learning tasks in need of techniques that create meaningful data representations and simultaneously reduce the burden and cost of human annotations.U of I OnlyAuthor requested U of Illinois access only (OA after 2yrs) in Vireo ETD syste
Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding
Deep Learning, a growing sub-field of machine learning, has been applied with tremendous success in a variety of domains, opening opportunities for achieving human level performance in many applications. However, Deep Learning methods depend on large quantities of data with millions of annotated instances. And while well-formed academic datasets have helped advance supervised learning research, in the real-word we are daily deluged by massive amounts of unstructured data, that remain unusable for current supervised learning approaches, as only a small portion is either labeled, cleaned or structured.
In order for a machine learning model to be effective, volume is not the only data dimension that is necessary. Quality is equally important and has proven to be a critical factor for the success of industrial applications of machine learning. According to IBM, poor data quality can cost more than 3 trillion US dollars per year for the US market alone. Inspired by the need for advanced methods that can efficiently address such bottlenecks, we develop machine learning techniques can be leveraged to improve upon data quality in both data-related dimensions: input and output space.
Having a set of labeled examples that can capture the task characteristics is one of the most important prerequisites for successfully applying machine learning. As such, we first focus on minimizing the annotation effort for any arbitrary user-defined task by exploring active learning methods. We show that the best performing active learning strategy depends on the task at-hand and we propose a combination of active learners, maximizing annotation performance early in the process. We demonstrate the viability of the approach on several relation extraction tasks.
Next, we observe that even though our method can be used to speed up the collection of labeled training data, the rest will remain unlabeled and thus unexploited. Semi-supervised learning methods proposed in the literature can utilize additional unlabeled data, however, are typically compared on computer vision datasets such as CIFAR10. Here, we perform a systematic exploration of several semi-supervised methods for three sequence labeling tasks and two classification tasks.
Additionally, most methods have assumptions that are less suitable to realistic scenarios. For example, proposed methods in the recent literature treat all unlabeled examples equally. Yet, in many cases we would like to sort out examples that might be less useful or confusing, particularly in noisy settings where examples with low training loss or high confidence are more likely to be clean examples. In addition, most methods assume that the unlabeled data can be classified into the same classes as the labeled data. This does not take into consideration the very possible scenario of out-of-class instances. For example, our classifier may be distinguishing cats from dogs, but the unlabeled examples may contain additional classes, such as shells, butterflies, etc. To this end, we design methods to mitigate these issues, with a re-weighting mechanism that can be incorporated to any consistency-based regularizer.
Both active and semi-supervised learning methods aim to reduce labeling efforts by either automatically expanding the training set or selecting the most informative examples for human annotation. However, bootstrapping approaches often result in negative effects on NLP tasks due to the addition of falsely labeled instances. We address the challenge of producing good quality proxy labels, by leveraging the continuously growing stream of human annotations. We introduce a calibration of semi-supervised active learning where the confidence of the classifier is weighted by an auxiliary neural model that remove incorrectly labeled instances and dynamically adjusts the number of proxy labels included in each iteration. Experimental results show that our strategy outperforms baselines that combine traditional active learning with self-training.
We have explored various ways on how to improve the output space of examples. But the input representation is also equally important. Particularly for social media, (the most abundant source of raw data nowadays) informal writing can cause several bottlenecks. For example, most Information Extraction (IE) tools rely on accurate understanding of text and struggle with the noisy and informal nature of social media due to high out-of-vocabulary (OOV) word rates. In this work, we design a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for any off-the-shelf NLP tool to adapt to social media noisy text. Our model surpasses baseline neural models designed for text normalization and achieves comparable performance with state-of-the-art related work.
Although we evaluate on NLP tasks, all methods developed are fairly general and can be applied to other supervised machine learning tasks in need of techniques that create meaningful data representations and simultaneously reduce the burden and cost of human annotations.U of I OnlyAuthor requested U of Illinois access only (OA after 2yrs) in Vireo ETD syste
Drink Bleach or Do What Now? Covid-HeRA: A Study of Risk-Informed Health Decision Making in the Presence of COVID-19 Misinformation
Given the widespread dissemination of inaccurate medical advice related to
the 2019 coronavirus pandemic (COVID-19), such as fake remedies, treatments and
prevention suggestions, misinformation detection has emerged as an open problem
of high importance and interest for the research community. Several works study
health misinformation detection, yet little attention has been given to the
perceived severity of misinformation posts. In this work, we frame health
misinformation as a risk assessment task. More specifically, we study the
severity of each misinformation story and how readers perceive this severity,
i.e., how harmful a message believed by the audience can be and what type of
signals can be used to recognize potentially malicious fake news and detect
refuted claims. To address our research questions, we introduce a new benchmark
dataset, accompanied by detailed data analysis. We evaluate several traditional
and state-of-the-art models and show there is a significant gap in performance
when applying traditional misinformation classification models to this task. We
conclude with open challenges and future directions.Comment: Accepted to AAAI ICWSM'22 Datasets Trac